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Q Optimal experimental design (OED) seeks experiments expected to yield the most useful 

data for some purpose. In practical circumstances where experiments are time-consuming or 
resource-intensive, OED can yield enormous savings. We pursue OED for nonlinear systems from 

^ a Bayesian perspective, with the goal of choosing experiments that are optimal for parameter 

inference. Our objective in this context is the expected information gain in model parameters, 
which in general can only be estimated using Monte Carlo methods. Maximizing this objective 
thus becomes a stochastic optimization problem. 

> 



This paper develops gradient-based stochastic optimization methods for the design of experi- 
ments on a continuous parameter space. Given a Monte Carlo estimator of expected information 
gain, we use infinitesimal perturbation analysis to derive gradients of this estimator. We are then 
00 able to formulate two gradient-based stochastic optimization approaches: (i) Robbins-Monro 

stochastic approximation, and (ii) sample average approximation combined with a determinis- 
ms^ tic quasi-Newton method. A polynomial chaos approximation of the forward model accelerates 
objective and gradient evaluations in both cases. We discuss the implementation of these opti- 
mization methods, then conduct an empirical comparison of their performance. To demonstrate 
CN desi gn in a nonlinear setting with partial differential equation forward models, we use the prob- 
. . lem of sensor placement for source inversion. Numerical results yield useful guidelines on the 
. ^ choice of algorithm and sample sizes, assess the impact of estimator bias, and quantify tradeoffs 
^ of computational cost versus solution quality and robustness. 
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1 Introduction 



Experimental data play a crucial role in the development of models — and the advancement of 
scientific understanding — across a host of disciplines. Some experiments are more useful than 
others, however, and a careful choice of experiments can translate to enormous savings of time 
and financial resources. Traditional experimental design methods, such as factorial and composite 
designs, are largely used as heuristics for exploring the relationship between input factors and 
response variables. Optimal experimental design, on the other hand, uses a model to guide the 
choice of experiments for a particular purpose, such as parameter inference, prediction, or model 
discrimination. Optimal design has seen extensive development for linear models endowed with 

* Corresponding author: yniarz@mit.edu, http://web.mit.edu/aeroastro/labs/uqlab/index.html, 77 Mas- 
sachusetts Avenue, Massachusetts Institute of Technology, Cambridge, MA 02139, USA. 
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Gaussian distributions [1]. Extensions to nonlinear models are often based on linearization and 
Gaussian approximations [2, 3, 4], as analytical results are otherwise impractical or impossible 
to obtain. With advances in computational power, however, optimal experimental design for 
nonlinear systems can now be tackled directly using numerical simulation [5, 6, 7, 8, 9, 10, 11]. 

This paper pursues nonlinear experimental design from a Bayesian perspective (e.g., [12]). 
The Bayesian statistical approach [13, 14] provides a rigorous foundation for inference from noisy, 
indirect, and incomplete data, and a natural mechanism for incorporating physical constraints and 
heterogeneous sources of information. We focus on experiments described by a continuous design 
space, with the goal of choosing experiments that are optimal for Bayesian parameter inference. A 
useful objective function for this purpose is the expected information gain in model parameters [15, 
16] — or equivalently, the mutual information between parameters and observables, conditioned 
on the design variables. This objective can be derived in a decision theoretic framework, using the 
Kullback-Leibler divergence from posterior to prior as a utility function [4]. From the numerical 
perspective, however, it is a complicated quantity. In general, it must be approximated using a 
Monte Carlo method [6, 17]. Consequently, only noisy estimates of the objective function are 
available, and the optimal design problem becomes a stochastic optimization problem. 

There are many approaches for solving continuous optimization problems with stochastic ob- 
jectives. While some do not require the direct evaluation of gradients (e.g., Nelder-Mead [18], 
Kiefer-Wolfowitz [19], and simultaneous perturbation stochastic approximation [20]), other algo- 
rithms can use gradient evaluations to great advantage. Broadly, these algorithms involve either 
stochastic approximation (SA) [21] or sample average approximation (SAA) [22], where the lat- 
ter approach must also invoke a gradient-based deterministic optimization algorithm. Hybrids of 
the two approaches are possible as well. In either case, for model-based experimental design, one 
must employ gradients of the information gain objective described above. This objective function 
itself involves nested integrations over possible model outputs and the input parameter space, 
where the model output may be a functional of the solution of a partial differential equation. 
In many practical cases, the model may be essentially a black box; while in other cases, even if 
gradients can be evaluated with adjoint methods, using the full model to evaluate the expected 
information gain or its gradient is computationally prohibitive. 

We address these difficulties by constructing polynomial surrogates for the the model output, 
i.e., polynomial chaos expansions [23, 24, 25, 26, 27, 28, 29] that capture dependence on both 
uncertain parameters and design variables. Given a polynomial surrogate and the correspond- 
ing Monte Carlo estimator of the expected information gain, we use infinitesimal perturbation 
analysis to derive gradients of this estimator. We are then able to formulate two gradient-based 
stochastic optimization algorithms for nonlinear experimental design: (1) Robbins-Monro (RM) 
stochastic approximation, and (2) sample average approximation combined with the Broyden- 
Fletcher-Goldfarb-Shanno method (SAA-BFGS). We will discuss the implementation of both 
methods, then compare their performance. 

The Robbins-Monro algorithm [30] is one of the earliest and most widely used stochastic ap- 
proximation methods, and has become a prototype for many subsequent algorithms. It involves 
an iterative update that resembles steepest descent, except that it uses stochastic gradient in- 
formation. Sample average approximation (SAA) (also known as the retrospective method [31] 
or the sample-path method [32]) is a more recent approach, with theoretical analysis initially 
appearing in the 1990s [22, 32, 33]. Convergence rates and stochastic bounds, although useful, do 
not necessarily reflect empirical performance under finite computational resources and imperfect 
numerical optimization schemes. To the best of our knowledge, extensive numerical testing of 
SAA has focused on stochastic programming problems with special structure (e.g., linear pro- 
grams with discrete design variables) [34, 35, 36, 37, 38]. While numerical improvements to 
SAA have seen continual development (e.g., estimators of optimality gap [39, 40] and sample 
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size adaptation [41, 42]), the practical behavior of SAA in more general optimization settings is 
largely unexplored. This paper will therefore include a numerical assessment of SAA-BFGS in a 
nonlinear continuous-variable design setting. 

Sample average approximation is often compared to stochastic approximation methods such 
as RM. For example, Shapiro [43] suggests that SAA is more robust than SA because of sensitivity 
to step size choice in the latter. While acknowledging this shortcoming, Nemirovski et al. [44] 
presented a robust mirror-descent SA algorithm that, for certain classes of convex problems, 
reaches accuracy comparable to that of SAA in substantially less time. This paper will also 
compare SA and SAA, but from a practical and numerical perspective and in the context of 
optimal Bayesian experimental design. 

This paper is organized as follows. Section 2 introduces optimal Bayesian experimental design 
(§2.1) and extracts the underlying stochastic optimization problem (§2.2), then presents the RM 
(§2.2.1) and SAA-BFGS (§2.2.2) algorithms. The challenge of evaluating gradient information 
appropriate to each of these algorithms is described in Section 2.3. Section 3 and Section 4 
describe how to obtain gradients (or gradient estimators) for the experimental design objective 
using polynomial chaos expansions and infinitesimal perturbation analysis. Section 5 then ana- 
lyzes the numerical performance of RM and SAA-BFGS on an optimal sensor placement problem 
involving contaminant diffusion. Conclusions on the algorithms and the relative strengths of SA 
and SAA for optimal experimental design are provided in Section 6. 



2 Optimal Bayesian Experimental Design 
2.1 Background 

We are interested in choosing the "best" experiments^ from a continuously parameterized design 
space, for the purpose of inferring model parameters from noisy and indirect observations. In other 
words, we seek experiments that are optimal for parameter inference (in a sense to be precisely 
defined below), with inference performed in a Bayesian setting. In the problems considered here, 
the mean observations are nonlinear functions of the model parameters, and the observations and 
model parameters are continuous random variables. 
Bayes' rule describes the parameter update process: 

f fa\ A\ /Y|©,d(y|6',d)/e|d(0|d) 

= — wsw — ■ 

Here represents the uncertain parameters of interest, Y the observations, and d the design 
variables. Like the observations and parameters, the design parameters are continuous. Also 
/©|d is the prior density, /Y|©,d is the likelihood function, /e|Y,d is the posterior density, and 
/Y|d is the evidence. It is reasonable to assume that prior knowledge on does not vary with 
the design choice, leading to the simplification /©|(i(0|d) = f@{0). 

Taking the decision theoretic approach proposed by Lindley [15, 16], we use the Kullback- 
Leibler (KL) divergence [45, 46] from the posterior to the prior as a utility function, and take its 
expectation under the prior predictive distribution of the data to obtain an expected utility f/(d): 

7e|Y,d(0|y,d) 



[/(d) = / / /0|Y,d(0|y,d)ln 

jy Jh 



d0/Y|d(y|d)dy (2) 



lEvid [^KL (/e|Y,d(-|Y,d)||/e(.))] . 



^These design choices will be made all-at-once; this setup corresponds to batch or open-loop design. In contrast, 
sequential or closed-loop design allows the results of one set of experiments to guide the next set. Rigorous approaches 
to optimal closed-loop design are more challenging, and will not be tackled in this paper. 
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Here % is the support of f&{9) and y is the support of /Y|d(y|d)- Because the observation 
Y cannot be known before the experiment is performed, taking the expectation over the prior 
predictive /Y|d l^ts the resulting utihty function reflect the information gain on average, over all 
anticipated outcomes of the experiment. The KL divergence may be understood as information 
gain: larger KL divergence from posterior to prior implies that the data Y decrease entropy in 
by a larger amount, and hence are more informative for parameter inference. The expected 
utility U (d) is thus the expected information gain due to an experiment performed at conditions 
d, which is equivalent to the mutual information between the parameters and the observables 
y conditioned on d. A more detailed derivation and discussion can be found in [11]. 

Typically, the expected utility in (2) has no closed form (even if the predictive mean of the 
data is, for example, a polynomial function of 6). Instead, it must be approximated numerically. 
By applying Bayes' rule to the quantities inside and outside the logarithm in (2), and then 
introducing Monte Carlo approximations for the resulting integrals, we obtain the nested Monte 
Carlo estimator proposed by Ryan [6]: 



U{d) « ^7^, 



AI 



[d,9s,ys) = 



1 

N 



N 



In 



1=1 



M 



M 



E/Y|0,d(y^*¥^*'''U) 



where Os = {0^*^} U l^^*'-'^!, i = 1...N, j = 1 . . . M, are i.i.d. samples from the prior /©; 

and Ys = {y^*^}, i = 1 . . . N , are independent samples from the likelihoods /Y|©,d("l^^*^) d). The 
variance of this estimator is approximately A{d)/N + B(d)/NM and its bias is (to leading order) 
C(d)/M [6], where A, B, and C are terms that depend only on the distributions at hand. While 
the estimator Un,m is biased for finite M, it is asymptotically unbiased. 

Finally, the expected utility must be maximized over the design space T) to find the optimal 
experiment (s): 



(3) 



argmax Uid). 
de© 



(4) 



Since U can only be approximated by Monte Carlo estimators such as Un,m, optimization meth- 
ods for stochastic objective functions are needed. 



2.2 Stochastic optimization 

In this section we describe two gradient-based stochastic optimization approaches: Robbins- 
Monro stochastic approximation, and sample average approximation with the Broyden-Fletcher- 
Goldfarb-Shanno method. Both approaches require some flavor of gradient information, but they 
do not use the exact gradient of U{d). Calculating the latter is generally not possible, given that 
we only have a Monte Carlo estimator (3) of U{d). 

For simplicity, in this section only (§2.2), we will use a more generic notation to describe 
the stochastic optimization problem at hand. This will allow the essential ideas to be presented 
before tackling the additional complexities of the expected information gain estimator above. 
The problem to be solved is of the form: 



X = argmm 



in |/i(x) = Ew h{x,W) } 



(5) 



where x is the design variable, W is the (generally design-dependent) "noise" random variable, 
and h{x, w) is an unbiased estimate of the unavailable objective function h(x). 
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2.2.1 Robbins-Monro (RM) stochastic approximation 



The iterative update of the Robbins-Monro method is: 

Xk+i = Xk- akg{xk, w'), (6) 

where k is the iteration index and g{xk, w') is an unbiased estimator of the gradient (with respect 
to x) of h[x) evaluated at x^. In other words, E^^/ [g{x, W)] = 'Vxh{x), but g is not necessarily 
equal to V/i. Also, W' and W may, but need not, be related. The gain sequence Uk should satisfy 
the following properties 

oo oo 

Ofc = oo and o| < oo. (7) 

A;=0 k=0 

One natural choice, used in this study, is the harmonic step size sequence = l3/k, where /3 
is some appropriate scaling constant. For example, in the diffusion problem of Section 5, f3 is 
chosen to be 1.0 since the design space is [0, 1]^. With various technical assumptions on g and g, 
it can be shown that RM converges to the exact solution of (5) almost surely [21]. 

Choosing the sequence is often viewed as the Achilles' heel of RM, as the algorithm's 
performance can be very sensitive to step size. We acknowledge this fact and do not downplay the 
difficulty of choosing an appropriate gain sequence, but we will try to show that there exist logical 
approaches to selecting aj. that yield reasonable performance. More sophisticated strategies, 
such as search-then-converge learning rate schedules [47], adaptive stochastic step size rules [48], 
and iterate averaging methods [49, 21], have been developed and successfully demonstrated in 
applications. For simplicity, however, we will use only the harmonic step size sequence in this 
paper. 



2.2.2 Sample average approximation (SAA) 

Transformation to design-independent noise. The central idea of sample average ap- 
proximation is to reduce the stochastic optimization problem to a deterministic problem, by 
fixing the noise throughout the entire optimization process. In practice, if the noise W is design- 
dependent, it is first transformed to a design-independent random variable by moving all the 
design dependence into the function h. (An example of this transformation is given in Sec- 
tion 4.) The noise variables at different x then share a common distribution, and a common set 
of realizations is employed at all values of x. 

Such a transformation is always possible in practice, since the random numbers in any com- 
putation are fundamentally generated from uniform random (or really pseudo-random) numbers. 
Thus one can always transform W back into these uniform random variables, which are of course 
independent of x.^ For the remainder of this section (§2.2.2) we shall, without loss of generality, 
assume that W is independent of x. 



Reduction to a deterministic problem. SAA approximates the true optimization prob- 
lem in (5) with 



-^/i(x,u;i) 
i=i ) 



i^^g^^^\hN{x,Ws) = — } ^h{x,Wi)} , (8) 



^One does not need to go all the way to the uniform random variables; any higher-level "transformed" random 
variable, as long as it remains independent of x, suffices. 
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where Xs and h]\[{xs,Ws) are the optimal design and objective values under a particular set of N 
realizations of the random variable W, Wg = {'Wi}f=i- The same set of realizations is used for 
different values of x during the optimization process, thus making the minimization problem in 
(8) deterministic. (One can view this approach as an application of common random numbers.) 
A deterministic optimization algorithm can then be chosen to find approximation to x* . 

Estimates of h{xs) can be improved by using h]\f'{xs, Ws') instead of hi\f{xs,Ws), where hiyi{xs,Ws') 
is computed from a larger set of realizations Wg' = {wj}jLi with N' > N, in order to attain a 
lower variance. Finally, multiple (say T) optimization runs are often performed to obtain a sam- 
pling distribution for the optimal design values and the optimal objective values, i.e., x* and 
h]sf{x^g, wl), for t = 1 . . .T. The sets wl are independently chosen for each optimization run, but 
remain fixed within each run. Under certain assumptions on the objective function and the design 
space, the optimal design and objective estimates in SAA generally converge to their respective 
true values in distribution at a rate of [22, 33].'^ 

For the solution of a particular deterministic problem x* , stochastic bounds on the true optimal 
value can be constructed by estimating the optimality gap h{x\) — h{x*) [39, 40]. The first term can 

simply be approximated using the unbiased estimator h {x\ , w*, ) since Eiy__, hf^i {x\ , W^' ) = 

h{x\). The second term may be estimated using the average of the approximate optimal objective 
values across the T replicate optimization runs (based on u)*, rather than tf*;): 



1 ^ ^ 

hN = 7f^hN{xi,w\). (9) 



t=l 



This is a negatively biased estimator and hence a stochastic lower bound on h{x*) [39, 40, 50].^''^ 
The difference h at/ (x* , w\i ) —Hn is thus a stochastic upper bound on the true optimality gap 
h{x\) — h{x*). The variance of this optimality gap estimator can be derived from the Monte 
Carlo standard error formula [34]. One could then use the optimality gap estimator and its 
variance to decide whether more runs are required, or which approximate optimal designs are 
most trustworthy. 

Pseudocode for the SAA method is presented in Algorithm 1. At this point, we have reduced 
the stochastic optimization problem to a series of deterministic optimization problems; a suitable 
deterministic optimization algorithm is still needed to solve them. 

BFGS method. The Broyden-Fletcher-Goldfarb-Shanno (BFGS) method [51] is a gradient- 
based method for solving deterministic nonlinear optimization problems, widely used for its ro- 
bustness, ease of implementation, and efficiency. It is a quasi-Newton method, iteratively up- 
dating an approximation to the (inverse) Hessian matrix from objective and gradient evaluations 
at each stage. Pseudocode for the BFGS method is given in Algorithm 2. In the present im- 
plementation, a simple backtracking line search is used to find a stepsize that satisfies the first 
(Armijo) Wolfe condition only. BFGS is shown to converge super-linearly to a local minimum 



^More precise properties of these asymptotic distributions depend on properties of the objective and the set of 
optimal solutions to the true problem. For instance, in the case of a singleton optimum x* , the SAA estimates 
^Ni^si •) converge to a Gaussian with variance Va,Tw[h{x* , W)]/N. Faster convergence to the optimal objective value 
may be obtained when the objective satisfies stronger regularity conditions. The SAA solutions Xg are not in general 
asympotically normal, however. Furthermore, discrete probability distributions lead to entirely different asymptotics 
of the optimal solutions. 

'^Short proof from [50]: For any x G A", we have that Ew^ hN{x,Ws) = h(x), and that hpf{x,wl) > 

mui^,^xhN{x',wl). Then h{x) =Ew, hN{x,Ws) >Ew, '^niii.j,' hN{x' ,Ws) = Evv, hN{xl,Ws) ^Ew^I^n]- 
'"'The bias decreases monotonically with N [39]. 
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Algorithm 1: SAA method in pseudocode. 



Set optimality gap tolerance rj and number of replicate optimization runs T; 
t = l; 

while optimality gap estimate > rj and t <T do 
Sample the set wl = {wj}fLi; 

Perform a deterministic optimization run and find x*; 
Sample the larger set wl, = {w*}^]^ where N' > A; 
Compute hN'{xl,wl,) = ^ Y.f=i h 
Estimate the optimality gap and its variance; 
t = t + l; 
end 

Output the sets and {hN'{xl,wl,)}J^^ for post-processing; 



if a quadratic Taylor expansion exists near that minimum [51]. The limited memory BFGS (L- 
BFGS) [51] method can also be used when the design dimension becomes very large (e.g., more 
than 10^), such that the dense inverse Hessian cannot be stored explicitly. 



Algorithm 2: BFGS algorithm in pseudocode. In this context, hN{x,wl) is the deterministic 
objective function we want to minimize. 

Initialize starting point xq, inverse Hessian approximation Hq, gradient termination tolerance e; 
Initialize any other termination conditions and parameters; 
k = 0; 

while II V/?.Ar(xfc, w*) \\ > e and other termination conditions are not met do 
Compute search direction pk = —H^V h^^Xk, wl); 
Find acceptable stepsize ak via line search; 
Update position Xk+i = Xk + akPk] 

Define vectors Sk = Xk+i — Xk and Uk = VhN{xk+i, wl) — Vh^i^Xk, wl) ; 



Update inverse Hessian approximation Hk+i = (l — ) Hk (l — ) + 

k = k + l; 
end 

Output X* = Xk] 



2.3 Application to optimal design 

The main challenge in applying the aforementioned stochastic optimization algorithms to optimal 
Bayesian experimental design is the lack of readily-available gradient information. For RM, we 
need an unbiased estimator of the gradient of the expected utility, i.e., g in (6). For SAA-BFGS, 
we need the gradient of the finite-sample Monte Carlo approximation of the expected utility, i.e., 
VhN{-,wl). 

We address these needs by introducing two concepts: 

1. A simple surrogate model, based on polynomial chaos expansions (see Section 3), replaces 
the often computationally-intensive forward model. The purpose of the surrogate is twofold. 
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First, it allows the nested Monte Carlo estimator (3) to be evaluated in a computationally 
tractable manner. Second, its polynomial form allows the gradient of (3), V/iAr(-, w*), to 
be derived analytically. These gains come at the expense of introducing additional error 
via the polynomial approximation of the original forward model, however. In other words, 
given a surrogate for the forward model and the resulting expected information gain, we can 
derive exact gradients of a Monte Carlo approximation of this expected information gain, 
and use these gradients in SAA. 

2. Infinitesimal perturbation analysis (see Section 4) applied to (2), along with the estimator 
in (3) and the polynomial surrogate model, allows the analytical derivation of an unbiased 
gradient estimator g, as required for the RM approach. 

3 Polynomial chaos surrogates 
3.1 Background 

This section introduces the first of two computational tools used to address the challenges de- 
scribed in Section 2.3. Polynomial expansions will be used to mitigate the cost of repeated for- 
ward model evaluations. Later (see Section 4) they will also be used to help evaluate appropriate 
gradient information for stochastic optimization methods. 

Mathematical models of the experiment enter the inference and design formulation through 
the likelihood function /Y|©,d- Fo^' example, a simple likelihood function might allow for an 
additive discrepancy E between experimental observations and model predictions: 

Y = G(0,d) + E. (10) 

Here G{0,d) is the "forward model" describing the experiment; it is a function that maps 
both the design variables and the parameters into the data space. The discrepancy E is of- 
ten taken to be a Gaussian random variable, but is by no means limited to this; we will use /e 
to denote its probability density. Computationally intensive forward models can render Monte 
Carlo estimation of the expected information gain impractical. In particular, drawing a sam- 
ple from /Y|©,d(y|^5 fl) requires evaluating G at a particular {0,d). Evaluating the density 
/Y|©,d(y|^) d) = /E(y — G{6,d)) again requires evaluating G. 

To make these calculations tractable, one would like to replace G with a cheaper "surrogate" 
model that is accurate over the entire prior support Ti and the entire design space T>. Many 
options exist, ranging from projection-based model reduction [52, 53] to spectral methods based 
on polynomial chaos (PC) expansions [23, 24, 25, 26, 27, 28, 29, 54]. The latter approaches do 
not reduce the internal physics of the deterministic model; rather, they exploit regularity in the 
dependence of model outputs on uncertain input parameters and design variables. 

Polynomial chaos has seen extensive use in a range of engineering applications (e.g., [55, 56, 57, 
58]) including parameter estimation and inverse problems (e.g., [59, 60, 61]). More recently, it has 
also been used in open-loop optimal Bayesian experimental design [10, 11], with excellent accuracy 
and multiple order-of-magnitude speedups over direct evaluations of forward model. Unlike the 
present work, however, our earlier study [11] used only gradient-free stochastic optimization 
methods (Nelder-Mead and simultaneous perturbation stochastic approximation). 
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3.2 Formulation 



Any random variable Z with finite variance can be represented by an infinite series 

oo 

Z= J^ai^i(Hi,H2,...), (11) 

|i|=0 

where i = {ii,i2, ■ ■ ■) , ij E No, is an infinite-dimensional multi-index; |i| = ii -|- ^2 -|- . . . is the li 
norm; Oi G M are the expansion coefficients; Hj are independent random variables; and 

oo 

*i(Hi,H2,...) = n^*.(^^-) (12) 

are multivariate polynomial basis functions [25]. Here il^i- is an orthogonal polynomial of order 
ij in the variable Hj, where orthogonality is with respect to the density of Hj, 

Eh [iJn.{E)M^)] = [ i^n^iOMOUO di = 5^,n^^ [Vi(S)] , (13) 



and T is the support of fE{(,)- The expansion (11) is convergent in the mean-square sense [62]. 
For computational purposes, the infinite sum in (11) must be truncated to some finite stochastic 
dimension and a finite number of polynomial terms. A common choice is the "total-order" 
truncation |i| < p, but other truncations that retain fewer cross terms, a larger number of cross 
terms, or anisotropy among the dimensions are certainly possible [54]. 

In the optimal Bayesian experimental design context, the model outputs depend on both the 
parameters and the design variables. Constructing a new polynomial expansion at each value of d 
encountered during optimization is generally impractical. Instead, we can construct a single PC 
expansion for each component of G, depending jointly on and d [11]. To proceed, we assign 
one stochastic dimension to each component of and one to each component of d. Further, we 
assume an affine transformation between each component of d and the corresponding Hj; any 
realization of d can thus be uniquely associated with a vector of realizations ^j. Since the design 
variables will usually be supported on a bounded domain (e.g., inside some hyper- rectangle), the 
corresponding Hj are endowed with uniform distributions. The associated univariate -i/^j are thus 
Legendre polynomials. These distributions effectively define a uniform weight function over the 
design space V that governs where the L'^-convergent PC expansions should be most accurate.^ 

Constructing the PC expansion involves computing the coefficients Oj. This computation 
generally can proceed via two alternative approaches, intrusive and non-intrusive. The intrusive 
approach results in a new system of equations that is larger than the original deterministic system, 
but it needs be solved only once. The difficulty of this latter step depends strongly on the character 
of the original equations, however, and may be prohibitive for arbitrary nonlinear systems. The 
non-intrusive approach computes the expansion coefficients by directly projecting the quantity 
of interest (e.g., the model outputs) onto the basis functions ^'j. One advantage of this method 
is that the deterministic solver can be reused and treated as a black box. The deterministic 
problem then needs to be solved many times, but typically at carefully chosen parameter and 
design values. The non-intrusive approach also offers flexibility in choosing arbitrary functionals 



''In the present context, it is appropriate to view d as a deterministic design variable. Since the stochastic optimiza- 
tion algorithms used later all involve some level of randomness, however, the d values encountered during optimization 
may also be viewed as realizations from some probability distribution. Tlris distribution, if known, could replace the 
uniform distribution and define a more efficient weighted norm; however, it is almost always too complex to extract 
in practice. 
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of the state trajectory as observables; these functionals may depend smoothly on H even when 
the state itself has a less regular dependence. Here, we will employ a non-intrusive approach. 
Applying orthogonality, the PC coefficients are simply 



_ Es [G.(0(5),d(5))vl/i(5)] _ GM^), dim, {^)M^)d^ .... 

where Gc,i is the coefficient of ^'i for the cth component of the model outputs. Analytical 
expressions are available for the denominators Es [^'j (H)] , but the numerators must be evaluated 
numerically. When the evaluations of the integrand (and hence the forward model) are expensive 
and Us is large, an efficient method for numerical integration in high dimensions is essential. 

To evaluate the numerators in (14), we employ Smolyak sparse quadrature based on one- 
dimensional Clenshaw-Curtis quadrature rules [63]. Care must be taken to avoid significant 
aliasing errors when using sparse quadrature to construct polynomial approximations, however. 
Indeed, it is advantageous to recast the approximation as a Smolyak sum of constituent full- 
tensor polynomial approximations, each associated with a tensor-product quadrature rule that is 
appropriate to its polynomials [64, 54]. This type of approximation may be constructed adaptively, 
thus taking advantage of weak coupling and anisotropy in the dependence of G on and d. More 
details can be found in [54]. 

At this point, we may substitute the polynomial approximation of G into the likelihood func- 
tion /Y|e,di which in turn enters the expected information gain estimator (3). This enables fast 
evaluation of the expected information gain. The computation of appropriate gradient informa- 
tion is discussed next. 



4 Infinitesimal Perturbation Analysis 

This section applies the method of infinitesimal perturbation analysis (IPA) [65, 66, 67] to con- 
struct an unbiased estimator g of the gradient of the expected information gain, for use in RM. 
The same procedure yields the gradient V/ijv,m(") w^s) of a finite-sample Monte Carlo approxi- 
mation of the expected information gain, for use in SAA. The central idea of IPA is that under 
certain conditions, an unbiased estimator of the gradient of a function can be obtained by simply 
taking the gradient of an unbiased estimator of the function. We apply this idea in the context 
of optimal Bayesian experimental design. 

The first requirement of IPA is the availability of an unbiased estimator of the function. 
Unfortunately, as described in Section 2.1, Un,m in (3) is a biased estimator of U for finite M [6]. 
To circumvent this technicality, let us optimize the following objective function instead of U : 

?7Ar,Af(d, Os,ys)f&„Ys\d{ds,ys\d) dOs dys 



Vs J-Hs 

{N,M) 

UNMid, Os, Ys) n /Y|0,d(y^'^l^^'\ d)/0(0»)/e(0(^'^')) dOs dy,(15) 

-^^^ {ij)=(i,i) 

where Tis x is the support of the joint density /©^,Ys|d(^s) Ysl^). Our original estimator Un,m 
is now unbiased for the new objective by construction! The tradeoff, of course, is that the 
function being optimized is no longer the true U. But it is consistent in that [/^(d) — )• U{d) as 
M — )• oo, for any > 0. (To illustrate this convergence, realizations of U^^m, i-e., Monte Carlo 
approximations of Um, are plotted in Figure 2 for varying M.) 
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The second requirement of IPA comprises conditions allowing an unbiased gradient estimator 
to be constructed by taking the gradient of the unbiased function estimator. Standard conditions 
(see, for example, [67]) require that the random quantity (e.g., Un,m) be almost surely continuous 
and differentiable. Here, because Um,m is parameterized by continuous random variables that 
have densities with respect to Lebesgue measure, we can take a perspective that relies on Leibniz's 
rule with the following conditions: 

1- Um^m and [Un,k1^ ai'e continuous over the product space of design variables and random 

variables, D x Tig x ys', 
2. the density of the "noise" random variable is independent of d. 

The first condition supports the interchange of differentiation and integration according to 
Leibniz's rule. This condition might be difficult to verify in arbitrary cases, but the use of finite- 
order polynomial forward models and continuous distributions for the prior and observational 
noise ensures that we meet the requirement. 

The second condition is needed to preserve the form of the expectation. If it is violated, differ- 
entiation with respect to d must be performed on the /©^,Ys|d(^s) Ysl^l) term as well via the prod- 
uct rule, in which case the additional term Jy {7Af,Af(d, Og, Ys) V [/©j,,Ys|d(^s) Ysl^)] dOg dy^ 
would no longer be an expectation with respect to the original density. The likelihood-ratio 
method may be used to restore the expectation [68, 67], but it is not pursued here. Instead, 
it is simpler to transform the noise to a design-independent random variable as described in 
Section 2.2.2. 

In the context of optimal Bayesian experimental design, the outcome of the experiment Y is a 
stochastic quantity that depends on the design d. From the stochastic optimization perspective, 
Y is thus a noise variable. To demonstrate the transformation to design-independent noise, we 
assume a likelihood where the data result from an additive Gaussian perturbation to the forward 
model: 

Y = G(0,d) + E 

= G(0,d) + C(0,d)Z. (16) 

Here C is a diagonal matrix with non-zero entries reflecting the dependence of the noise standard 
deviation on other quantities, and Z is a vector of i.i.d. standard normal random variables. For 
example, "10% Gaussian noise on the cth component" would translate to Cc,j = (^cjO.l|Gc(0, d)|, 
where 5ci is the Kronecker delta function. For other forms of the likelihood, the right-hand side 
of (16) is simply replaced by a generic function of 0, d, and some random variable Z. Here, 
however, we will focus on the additive Gaussian form in order to derive illustrative expressions. 

By extracting a design-independent random variable Z from the noise term E = C(0,d)Z, 
we will satisfy the second condition above. The design-dependence of Y is incorporated into 
Un,m by substituting (16) into (3): 

N 



C/;v,M(d,0„z,) = l5^{ln[/Y|0,d(G(0«,d) + C(0W,d)z«|0«,d 

i=l 

M 

V7 E ( G iO^'^ , d) + C(0« , d)z« , d 



M 



, (17) 



where Zg = {z^*)}. The new noise variables are now independent of d. The samples of y^*) 
drawn from the likelihood are instead realized by drawing z^*^ from A^(0,I), then multiplying 
these samples by C and adding them to the model output. 
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With all conditions for IPA satisfied, an unbiased estimator of the gradient of Um, corre- 
sponding to g in (6), is simply Vd?7Ar,M(d, 0s, z^) since 



E 



VdC/iv,M(d,0s,Z, 



/ / VdUN,M{d,6s,Zs)f&^,zA^s,Zs)d0sdZs 

JZs J&s 

{7Ar,A/(d, 0s, Zs)/©,,Z, (0s, Zs) dOg dz^ 



V 



Vdt/Af(d), 



f^Ar,M(d, @s, Zs 



(18) 



where Zg is the support of fz^{zs). This gradient estimator is therefore suitable for use in RM. 

The gradient of the finite-sample Monte Carlo approximation of C/(d), i.e., VhN,M{',wl) used 
in SAA, takes exactly the same form. The only difference between the two is that g lets ©s and Zs 
be random at every iteration of the optimization process. When used as S/h]\f^M{',wi), ®s and Zs 
are frozen at some realization throughout the optimization process. In either case, these gradient 
expressions contain derivatives of the likelihood function and thus derivatives VdG(0,d). When 
G is replaced with a polynomial expansion, these derivatives can be computed inexpensively. 
Detailed derivations of the gradient estimator using orthogonal polynomial expansions can be 
found in Appendix A. 



5 Source Inversion Problem 
5.1 Governing equations 

We demonstrate the optimal Bayesian experimental design formulation and our stochastic op- 
timization tools on a two-dimensional contaminant identification problem. The goal is to place 
a single sensor that yields maximum information about the location of the contaminant source. 
Contaminant transport is governed by a scalar diffusion equation on a square domain: 

^ = V2u; + 5(x,,c,x,t), xGA'=[0,l]2, (19) 

where w{x, t; Xsrc) is the space-time concentration field parameterized by the coordinate of the 
source center Xgrc- We impose homogeneous Neumann boundary conditions 

Viu • n = on dX, (20) 

along with a zero initial condition 

u;(x,0;xsrc) = 0. (21) 
The source function has a Gaussian spatial profile 



(22) 

0, t>T 



where s, h, and r are known (prescribed) source intensity, width, and shutoff time parameters, 
respectively, and Xgrc = (0x,0y) is the unknown source location that we would ultimately like 
to infer. The design variable is the location of a single sensor, Xsensor = {dx,dy), and the mea- 
surement data comprise five noisy point observations of w at the sensor location and at 
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five equally-spaced sample times. For this study, we choose s = 2.0, h = 0.05, r = 0.3; a uniform 
prior G^:, Qy ^ U (0, 1); and an additive error model Yi = w (xgenson ^ij ; '^si:c)+ Ei,i = 1 ... 5, such 
that the Ei are zero-mean Gaussian random variables, mutually independent given Xgensor and 
Xgrc) each with standard deviation cij = 0.1 -|- 0.1 \w (xgensoD ii; Xgrc)!- I^i other words, the error 
associated with the data has a "floor" value of 0.1 plus an additional contribution that is 10% of 
the signal. The sensor may be placed anywhere in the square domain, such that the design space 
is {dx,dy) G [0, 1]^. Figure 1 shows an example concentration profile and measurements. 

Evaluating the forward model thus requires solving the partial differential equation (19) at 
fixed realizations of 6 = Xsrc and extracting the solution field at the design location d = Xgensor- 
We discretize (19) using 2nd-order centered differences on a 25 x 25 spatial grid and a 4th-order 
backward differentiation formula for time integration. As described in Section 3, we replace 
the full forward model with a polynomial chaos surrogate, for computational efficiency. To this 
end, we construct a Legendre polynomial approximation of the forward model output over the 
4-dimensional joint parameter and design space, using a total-order polynomial truncation of 
degree 12. This results in an extremely accurate surrogate, with relative errors ranging from 
6 X 10-3 to 10"*^. 

The optimal Bayesian experimental design formulation now seeks the sensor location Xgg^g^^ 
such that when the experiment is performed, on average — i.e., averaged over all possible source 
locations according to the prior, and over all possible resulting concentration measurements ac- 
cording to the likelihood — the five concentration readings {1^}^^^ yield the greatest information 
gain from prior to posterior. 

5.2 Results 

5.2.1 Objective function 

Before we present the results of numerical optimization, we first explore the properties of the 
expected information gain objective. Numerical realizations of Un,m for N = 1001 and M = 2, 
11, 101, and 1001 are shown in Figure 2. These plots can be interpreted as 1-sample Monte Carlo 
approximations of Um = IE[?7Ar^Af]i or equivalently, as Z-sample Monte Carlo approximations of 
Um = ^[U(n/1),m]- As N grows, Un,m becomes a better approximation to Um and as M grows, 
U M becomes a better approximation to U . The figures show that values of Um,m increase when 
M increases (for fixed N), suggesting a negative bias at finite M. At the same time, the objective 
becomes less flat in d; since U is certainly closer to the M = 1001 surface than the M = 2 surface, 
these results suggest that U is not particularly flat in d. This feature of the current design problem 
is encouraging, since stochastic optimization problems with higher curvature can be more easily 
solved; in the context of SA, for example, they effectively have a higher signal-to-noise ratio. 

The expected information gain objective inherits symmetries from the square, as expected 
from the physical nature of the problem. The plots also suggest a smooth albeit non-convex 
underlying objective ?7, with inflection points lying on an interior circle and four local maxima 
symmetrically located at the corners of the design space. The best placement for a single sensor 
is therefore at the corners of the design space, while the worst placement is at the center. The 
reason for this perhaps counter-intuitive result is that the diffusion process is isotropic: a series of 
concentration measurements can only determine the distance of the source from the sensor, not its 
orientation. The posterior distribution thus resembles an annulus of constant radius surrounding 
the sensor. A sensor placement that minimizes the area of these annuli, averaged over all possible 
source locations according to the prior, tends to be optimal. In this problem, because of the 
domain geometry and the magnitude of the observational noise, these optimal locations happen 
to be the furthest points from the domain center, i.e., the corners. 
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Figure 3 shows posterior probability densities for the source location, under different sensor 
placements, given data generated from a "true" source centered at Xgi-c = (0.09,0.22). The pos- 
terior densities are evaluated using the polynomial chaos surrogate, while the data are generated 
by directly solving the diffusion equation on a denser (101 x 101) spatial grid than before and 
then adding the Gaussian noise described in Section 5.1. Note that the posteriors are extremely 
non-Gaussian. Moreover, they generally include the true source location, but do not center on it. 
Reasons for not expecting the posterior mode to match the true source location are twofold: first, 
we have only 5 measurements, each perturbed with a relatively significant random noise; second, 
there is model error, due to mismatch between the polynomial chaos approximation constructed 
from the coarser spatial discretization of the PDE and the more finely-discretized PDE model 
used to simulate the data.''*^ For this source configuration, it appears that a sensor placed at any 
of the corners yields a "tighter" posterior than a sensor placed at the center. But we must keep in 
mind that this result is not guaranteed for all source locations and data realizations; it depends 
on where the source actually is. (Imagine, for example, if the source happened to be very close to 
the center of the domain; then the sensor at (0.5, 0.5) would yield the tightest posterior.) What 
the optimal experimental design method yields is the optimal sensor placement averaged over the 
prior distribution of the source location and the predictive distribution of the data. 



5.2.2 Stochastic optimization results 

We now analyze the optimization results, first assessing the behavior of the two stochastic opti- 
mization methods individually, and then comparing their performance. 

Recall that the RM algorithm is essentially a steepest-ascent method (since we are maximizing 
the objective) with a stochastic gradient estimate. Figures 4, 5, and 6 each show four sample 
RM optimization paths overlaid on the Un,m surfaces from Figure 2. The optimization does 
not always proceed in an ascent direction, due to the noise in the gradient estimate, but even 
a noisy gradient can be useful in eventually guiding the algorithm to regions of high objective 
value. Naturally, fewer iterations are needed and good designs are more likely to be found when 
the variance of the gradient estimator is reduced by increasing and M. Note that one must 
be cautious not to over-generalize from these figures, since the paths shown in each plot are not 
necessarily representative. Instead, their purpose is to provide intuition about the optimization 
mechanics. Data derived from many runs are more appropriate performance metrics, and will be 
used later in this section. 

For SAA-BFGS, each choice of the sample set to* yields a different deterministic objective; 
example realizations of this objective surface are shown in Figures 7, 8, and 9. For each realization, 
a local maximum is found efficiently by the BFGS algorithm, requiring only a few (usually less 
than 10) iterations. For each set of results corresponding to a particular (i.e., each of Figures 
7-9), the random numbers used for smaller values of M are proper subsets of those used for 
larger M. We thus expect some similarity and a sense of convergence among the subplots in each 
figure. Note also that when A^ is low, realizations of the objective can be extremely different 
from Figure 2 (for example, the plots in Figure 7 have local maxima near the center of the 
domain), although improvement is observed as A^ is increased. In general, each deterministic 
problem in SAA can have very different features than the underlying objective function. None of 
the realizations encountered here has maxima at the corners, or is even symmetric. Nonetheless, 

^Indeed, there are two levels of model error: (1) between the PC expansion and the PDE model used to construct 
the PC expansion, which has a Ax = Ay = 1/24 spatial discretization; (2) between this PDE model and the more 
finely discretized (Ax = Ay = 1/100) PDE model used to simulate the noisy data. 

^Model error is an extremely important aspect of uncertainty quantification [13], but its treatment is beyond the 
scope of this study. Understanding the impact of model error on optimal experimental design is an important direction 
for future work. 
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when sampling over many SAA subproblems, even a low N can provide reasonably good results. 
This will be shown in Tables 1 and 2, and discussed in detail below. 

To compare the performance of RM and SAA-BFGS, 1000 independent runs are conducted for 
each algorithm, over a matrix of N and M values. We make reasonable choices for the numerical 
parameters in each algorithm (e.g., gain schedule scaling, termination criteria) leading to similar 
run times. Histograms of the final design parameters (sensor positions) resulting from each set 
of 1000 optimization runs are shown in Table 1. The top figures in each major row represent RM 
results, while the bottom figures in each major row correspond to SAA-BFGS results. Columns 
correspond to different values of M. It is immediately apparent that more designs cluster at 
the corners of the domain as and M are increased. For the case with the largest number of 
samples {N = 101 and M = 1001), each corner has around 250 designs, suggesting that higher 
sample sizes cannot further improve the optimization results. An "overlap" in quality across the 
different N cases is also observed: for example, results of the N = 101, M = 2 case are worse 
than those of the = 11, M = 1001 case. A balance is thus needed in choosing samples sizes A^ 
and M, and it is not ideal to heavily favor sampling either the inner or outer Monte Carlo loop 
in Uis[,M- Overall, comparing the RM and SAA-BFGS plots at intermediate values of M and A^, 
we see that RM has a slight advantage over SAA-BFGS by placing more designs at the corners. 

The distribution of final designs alone does not reflect the robustness of the optimization 
results. For example, if U is very flat near the optimum, then suboptimal designs need not be 
very close to the true optimum in the design space to be considered good designs in practice. 
To evaluate robustness, a "high-quality" objective estimate ?7iooi,iooi is computed for each of 
the 1000 final designs considered above. The resulting histograms are shown in Table 2, where 
again the top subrows are for RM and the bottom subrows are for SAA-BFGS, with the results 
covering a full range of N and M values. In keeping with our previous observations, performance 
is improved as A^ and M are increased — in that the mean (over the optimization runs) expected 
information gain increases, while the variance in the expected information gain decreases. Note, 
however, that even if all 1000 optimization runs produced identical final designs, this variance will 
not reach zero, as there exists a "floor" corresponding to the variance of the estimator f/iooi,iooi- 
This minimum variance can be observed in the histograms of the RM results with A^ = 101 and 
M = 101 or 1001. 

One interesting feature of the histograms in Table 2 is their bimodality. The higher mode 
reflects designs near the four corners, while the lower mode encompasses all other suboptimal 
designs. As A^ or M increase, we observe a transfer of probability mass from the lower mode to 
the upper mode. However, the sample sizes are not large enough for the lower mode to completely 
disappear for most cases; it is only absent in the two RM cases with the largest sample sizes. 
Overall, the histograms are similar in shape for both algorithms, but RM appears to produce less 
variability in the expected information gain, particularly at high A^ values. 

Table 3 shows histograms of optimality gap estimates from the 1000 SAA-BFGS runs. Since 
we are dealing with a maximization problem (for the expected information gain), the estimator 
from §2.2.2 is reversed in sign, such that the upper bound is now and the lower bound 
is hNi{x\,w\i). The lower bound must be evaluated with the same inner-loop Monte Carlo 
sample size M used in the optimization run in order to represent an identically-biased underlying 
objective; hence, the lower bound values will not be the same as the "high-quality" objective 
estimates f/iooi,iooi discussed above. From the table, we observe that as A^ increases, values of 
the optimality gap estimate decrease. This is a result of the lower bound rising with A^ (since 
the optimization is better able to find designs in regions of large Um^ e.g., corners of the domains 
in Table 1), and the upper bound simultaneously falling (since its positive bias monotonically 
decreases with A^ [39]). Consequently, both bounds become tighter and the gap estimates tend 
toward zero. On the other hand, as M increases, the variance of the gap estimates increases. 



15 



Since the upper bound (/iat) is fixed for a given set of SAA runs, the spread is only affected by the 
variabihty of the lower bound. Indeed, from Figure 2, it is apparent that the objective becomes 
less flat as M increases, with the highest gradients (considering the good design regions only) 
occurring at the corners. This translates to a higher sensitivity, as a small "imperfection" in the 
design would lead to larger changes in objective estimate; one then would expect the variation of 
hp^i{x\,w\i) to become higher as well, leading to greater variance in the gap estimates. Finally, as 
M increases, the histogram values tend to increase, but they increase more slowly for larger values 
of N . Some intuition for this result may be obtained by considering the relative rates of change 
of the upper and lower bounds with respect to M, given different values of N . Again referring 
to Figure 2, the objective values generally increase with M, indicating an increase of the lower 
bound. This increase should be more pronounced for larger N ^ since the optimization converges 
to designs closer to the corners, where, as mentioned earlier, the objective has larger gradient. 
The upper bound increases with M as well, as indicated by the contour levels in Figures 7, 8, 
and 9. But this rate of increase is observed to be slowest at the highest (i.e., in Figure 9). 
Combining these two effects, it is reasonable that as A^ increases, the gap estimate will increase 
with M at a slower rate. 

Can the optimality gap be used to choose values of M and A? For a fixed M, we certainly 
have convergence as A increases, and the gap estimate can be a good indicator of solution quality. 
However, because different values of M correspond to different objective surfaces (due to the bias 
of Un^m), the optimality gap is unsuitable for comparisons across different values of M; indeed, in 
our example, even though solution quality is improved with M, the gap estimates appear looser 
and noisier. 

Another performance metric we extract from the stochastic optimization runs is the number 
of iterations required to reach a solution; histograms of iteration number for RM and SAA, 
for the same matrix of M and A^ values, are shown in Table 4. At low sample sizes, many of 
the SAA-BFGS runs take only a few iterations, while almost all of the RM runs terminate at 
the prescribed upper limit of 50 steps. This difference again reflects the efficiency of BFGS for 
deterministic optimization problems. As N and M are increased, the histograms show a "transfer 
of mass" from higher iteration numbers to lower iteration numbers, coinciding somewhat with 
the bimodal behavior described previously. The reduction in iteration number with increased 
sample size implies that an n— fold increase in sample size leads to an increase in computational 
time that is often much less than a factor of n. Accounting for this sublinear relationship when 
allocating computational resources, especially if samples can be drawn in parallel, can lead to 
substantial savings. Although SAA-BFGS generally requires fewer iterations, each iteration takes 
longer than a step of RM. RM thus offers a higher "resolution" in run times, potentially giving 
more freedom to the user in stopping the algorithm. RM thus becomes more attractive as the 
evaluation of the objective function becomes more expensive. 

As a single integrated measure of the quality of the stochastic optimization solutions, we 
evaluate the following mean square error (MSE): 



where d*, t = 1 . . . T, are the final designs from a given optimization algorithm, and U is the 
true optimal value of the expected information gain. Since the true optimum is unavailable in this 
study, U'^'^^ is taken to be the maximum value of the objective over all runs. Recall that the MSE 
combines the effects of bias and variance; here it reflects the variance in objective values plus 
the difference (squared) between the mean objective value and the true optimum, calculated via 
T = 1000 replicated optimization runs. Figure 10 relates solution quality to computational effort 
by plotting the MSE against average computational time (per run). Each symbol represents a 




(23) 



t=i 
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particular value of (x, Q, and □ represent = 1, 11, and 101, respectively), while the four 
different M values are reflected through the average run times. These plots confirm the behavior 
we have previously encountered. Solution quality generally improves (lower MSE) with increasing 
sample sizes, although a balanced allocation of samples must be chosen. For instance, a large N 
with small M can yield inferior solutions to a smaller N with larger M; on the other hand, for 
any given N, continued increases in M beyond some threshold yield minimal improvements in 
MSE. The best sample allocation is described by the minimum of all the curves. We highlight 
these "optimal fronts" in light red for RM and in light blue for SAA-BFGS. Monte Carlo error in 
the "high-quality" estimator ?7iooi,iooi may also be reflected in the non-zero MSE asymptote for 
the high- A/' RM cases. 

According to Figure 10, RM outperforms SAA-BFGS by consistently achieving smaller MSE 
for a given computational effort. One should be cautious, however, in generalizing from these 
numerical experiments. The advantage of RM is relatively small, and other factors such as code 
optimization, choices of algorithm parameters, and of course the experimental design problem 
itself can affect or even reverse this advantage. 

6 Conclusions 

This paper has explored the stochastic optimization problem arising from a general nonlinear for- 
mulation of optimal Bayesian experimental design. In particular, we employed an objective that 
reflects the expected information gain in model parameters due to an experiment, and formulated 
two gradient-based approaches to stochastic optimization in this context: Robbins-Monro (RM) 
stochastic approximation, and sample average approximation (SAA) coupled with BFGS. Both 
of these algorithms require gradient information derived from Monte Carlo approximations of the 
objective: an unbiased gradient estimator in the former case, and gradients of a finite-sample 
Monte Carlo estimate in the latter case. Methods for extracting this gradient information must 
contend with an estimator of expected information gain that is not a simple Monte Carlo sum, but 
rather contains nested Monte Carlo estimates. It is therefore expensive to evaluate, and biased 
for finite inner-loop sample sizes. To circumvent these challenges, we approximate the forward 
model embedded in the likelihood function with a polynomial chaos expansion, and maximize 
the expected information gain computed via this approximation instead. Gradient information 
is readily extracted from the polynomial chaos expansion, with the help of a simple perturbation 
analysis. 

We analyze the performance of the two stochastic optimization approaches using the problem 
of sensor placement for source inversion, cast as optimal experimental design over a continuous 
design space. Numerical experiments, performed over a matrix of inner- and outer-loop sample 
sizes, examine the impact of bias and variance in the objective function and gradient estimates on 
the efficiency of the optimization algorithms and on the quality of the resulting solutions. These 
experiments suggest (unsurprisingly) that solution quality improves as sample sizes increase, but 
also that optimization runs may converge in fewer iterations for larger sample sizes. Also, a 
balanced allocation of computational resources between the inner and outer Monte Carlo sums 
is important for computational efficiency. Arbitrarily increasing the inner-loop sample size, for 
instance, yields little improvement in solution quality when the outer-loop samples are too few. 
Our results also suggest that RM has a consistent performance advantage over SAA-BFGS, 
but this conclusion is necessarily problem-dependent. Instead of declaring one algorithm to be 
superior, our broader goal is to illustrate the differences between the two algorithms and provide 
some selection guidelines based on their properties. 

The SAA approach may provide more flexibility than SA, as it can be combined with any 
deterministic optimization algorithm, whereas the SA approach essentially specifies the form of 
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each optimization iteration. SAA's flexibility allows one to take advantage of problem structure: 
if realizations of the objective surface are known to be "well-behaved" and smooth, gradient- 
based algorithms such as BFGS can exploit this regularity, as in the present source inversion 
example. On the other hand, if the objective is not smooth, or if gradients are not available, 
some gradient-free deterministic algorithm may be more appropriate. Estimates of optimality 
gap, obtained from replicate SAA solutions, can be used to adaptively adjust the outer-loop 
Monte Carlo sample size, but are unsuitable for assessing the inner-loop sample size because of 
bias effects. Future work could employ the common random number stream approach in [40] 
to obtain a lower- variance estimate of optimality gap (along with a confidence interval), or the 
jackknife technique proposed in [69] for bias reduction. 

The RM algorithm and other stochastic approximation methods, on the other hand, must 
use a stochastic gradient estimator. This can lead to poor performance if only high- variance 
gradient estimates are available. In the current context, increasing the outer-loop sample size 
reduces variance and the RM algorithm performed relatively well. It is also worth noting that the 
frequent (yet cheaper) steps of RM effectively provide a finer resolution in run time than SAA, 
giving the user more freedom to terminate the algorithm without losing much progress between 
the termination time and the previous optimization iteration. Therefore, RM may become more 
attractive as objective evaluations become more expensive.^ 

The present approach used a global polynomial chaos surrogate, constructed over the product 
of the parameter space and the design space D. In model-based methods for deterministic 
derivative-free optimization, one might prefer to construct local surrogates valid over increasingly 
smaller intervals of D, particularly as one approaches the optimum. Pursuing similar ideas in the 
stochastic context could possibly offer additional accuracy, but sampling errors in the stochastic 
optimization solution will always limit potential gains. 

Finally, as we pointed out in Section 2.1, this paper has focused on batch or open-loop 
experimental design, where the parameters for all experiments are chosen before data are actually 
collected. An important target for future work is rigorous sequential or closed-loop design, where 
the data from one set of experiments are used to guide the choice of the next set. Here we 
expect stochastic optimization algorithms, for expected information gain and other objectives, to 
continue playing a crucial role. 
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^Even if a polynomial chaos expansion is used as a surrogate for the forward model, its evaluation can become 
expensive if the stochastic dimension and polynomial order are high, though it remains much cheaper than the original 
model. 
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8 Figures and Tables 




Figure 1: Example forward model solution and realizations from the likelihood. In particular, the solid 
line represents the time-dependent contaminant concentration w(x, t;Xsrc) at x = Xgcnsor = (0.0,0.0), 
given a source centered at Xgrc = (0.1,0.1), source strength s = 2.0, width h = 0.05, and shutoff 
time T = 0.3. Parameters are defined in the diffusion equation (19). The five crosses represent noisy 
measurements at five designated measurement times. 
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(c) N = 1001, M = 101 (d) N = 1001, M = 1001 



Figure 2: Surface plots of independent U^^m realizations, evaluated over the entire design space 
[0, 1]^ 9 d = {x,y). Note that the vertical axis ranges and color scales vary among the subfigures. 
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Figure 3: Contours of posterior probability density for the source location, given different sensor 
placements. The true source location, marked with a blue circle, is Xgrc = (0.09,0.22). 
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(c) N ^1, M = 101 (d) N = 1, M = 1001 

Figure 4: Sample paths of the RM algorithm with = 1, overlaid on Un,m surfaces from Figure 2 
with the corresponding M values. The large □ is the starting position and the large x is the final 
position. 
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(c) iV = 11, M = 101 (d) TV = 11, M = 1001 



Figure 5: Sample paths of the RM algorithm with = 11, overlaid on Un,m surfaces from Figure 2 
with the corresponding M values. The large □ is the starting position and the large x is the final 
position. 
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(c) N = 101, M = 101 (d) N = 101, M = 1001 

Figure 6: Sample paths of the RM algorithm with = 101, overlaid on Un^m surfaces from Figure 2 
with the corresponding M values. The large □ is the starting position and the large x is the final 
position. 
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(c) N = 1, M = 101 {d) N = 1, M = 1001 

Figure 7: Realizations of the objective function surface using SAA, and corresponding steps of BFGS, 
with = 1. The large □ is the starting position and the large x is the final position. 
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(c) N = 11, M = 101 (d) N = 11, M = 1001 

Figure 8: Realizations of the objective function surface using SAA, and corresponding steps of BFGS, 
with = 11. The large □ is the starting position and the large x is the final position. 
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(c) N = 101, M = 101 (d) N = 101, M = 1001 



Figure 9: Realizations of the objective function surface using SAA, and corresponding steps of BFGS, 
with = 101. The large □ is the starting position and the large x is the final position. 
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Table 1: Histograms of final search positions resulting from 1000 independent runs of RM (top 
subrows) and SAA (bottom subrows) over a matrix of and M sample sizes. For each histogram, 
the bottom-right and bottom-left axes represent the sensor coordinates x and respectively, while 
the vertical axis represents frequency. 
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Table 2: High-quality expected information gain estimates at the final sensor positions resulting from 
1000 independent runs of RM (top subrows, blue) and SAA-BFGS (bottom subrows, red). For each 
histogram, the horizontal axis represents values of f/j\/=iooi,Af=iooi and the vertical axis represents 
frequency. 
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Table 3: Histograms of optimality gap estimates for SAA-BFGS, over a matrix of samples sizes M 
and A^. For each histogram, the horizontal axis represents value of the gap estimate and the vertical 
axis represents frequency. 
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Table 4: Number of iterations in each independent run of RM (top subrows, blue) and SAA-BFGS 
(bottom subrows, red), over a matrix of sample sizes M and A^. For each histogram, the horizontal 
axis represents iteration number and the vertical axis represents frequency. 
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Figure 10: Mean square error, defined in (23), versus average run time for each optimization algorithm 
and various choices of inner-loop and outer-loop sample sizes. The highlighted curves are "optimal 
fronts" for RM (light red) and SAA-BFGS (light blue). 
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A Analytical Derivation of the Unbiased Gradient Es- 
timator 

In this section, we derive the analytical form of the unbiased gradient estimator \7UN,M{d, ^s, z^),^" 
following the method presented in Section 4. 

The estimator Ui\i^M{d,9s,Zs) is defined in (17). Its gradient in component form is 



^UN,Mid,6s,Zs) 



g^t^Af,M(d, Os,Zs 



-Un,m{^, ^s,Zs 



(24) 



where is the dimension of the design parameters d and da denotes the ath component of d. 
The ath component of the gradient is then 



d 
dda 



1 

N 



N ( a 

dda 



E 

1=1 



/Y|e,d (G(0»,d) + C(0«,d)z»|0W,d) 



/Y|0,d (G(0«, d) + C(0W, d)z« d) 
EfLi gj/Yie.d (G(0^'U) + C(0W,d)z(-) \e(^'^\d] 
' Ef=i/Y|e,d (G(0«,d) + C(0W,d)z« \6i^>r),d) 



■ (25) 



Partial derivatives of the likelihood function with respect to d are required above. We assume 
that each component of C{6^^\d) is of the form Oc + /3c|Gc(0^*\ d)|, c = 1 . . . Uy, where Uy is the 
dimension of the data vector Y, and ac, Pc are constants. Also, let the random vectors z^*) be 
mutually independent and composed of i.i.d. components, such that the data are conditionally 
independent given and d. The derivative of the likelihood function then becomes 

_d_ 
dda' 

d 



-fY\&,d (G(0«,d) + C(0«,d)z« 



d 



dda 

Uy 

E 
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,c=l 

y r Q 



n/>'c|©,d(G'c(0«,d) + («c + /3c|G,(0«,d)|). 



6>(*'J) d 



^/n,|e,d (G,(0«,d) + (afc + A|G,(0W,d)|)z« ^(^'■'^d 



n fYc\&,d (Gc(0«, d) + («c + /3c|Ge(0», d)|)z(^ 



6>(^'^') d 



c=l 



(26) 



Introducing a standard normal density for each Zc \ the likelihood associated with a single com- 
ponent of the data vector is 



/F.|©,d (G,(0«,d) + (a, + /3,|G,(0«,d)|); 



2TT («, + /3,|G,(0(^J),d)|) 



exp 



Gc(0(''J"), d) - (G,(0W, d) + (a, + /3,|Ge(0W, d)|)zW) 
2(a, + /?,|Ge(0(^^^),d)|)' 



7) 



^Recall that this estimator is unbiased with respect to the gradient of Um- 
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and its derivatives are 
d 



dda 

-/3,sign(G,(0(^'^),d))gJ^G,(0(*'j),d) 
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In cases where conditioning on is replaced by conditioning on 0^*^ (i-e., for the first summa- 
tion term in equation (25)), the expressions simpUfy to 



/y.|0,d(Gc(0«,d) + (ae + /3c|G,(0»,d)|)z«|0»,d) 



/2^(a, + /3e|Ge(0«,d)|) 



exp 



(29) 



and 



/y.|0,d(Gc(0(') , d) + (a, + /3eGe(0« , d))z« , d) 



dda 
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exp 



(30) 



We now require the derivative of each model output Gc with respect to d. In most cases, 
this quantity will not be available analytically. One could use an adjoint method to evaluate the 
derivatives, or instead employ a finite difference approximation, but embedding these approaches 
in a Monte Carlo sum may be prohibitive, particularly if each forward model evaluation is com- 
putationally expensive. The polynomial chaos surrogate introduced in Section 3 addresses this 
problem by replacing the forward model with polynomial expansions for either Gc 



G,(0«,d)« ^gb^b {i{e^\d) 



(31) 



or InGr 



Ge(0(*\d)«exp 



E^b^b (^(0(^\d) 



(32) 



(28) 
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Here gy^ are the expansion coefficients and is an admissible multi-index set indicating which 
polynomial terms are in the expansion. For instance, if uq is the dimension of 6 and is the 
dimension of d, such that ne + is the dimension of ^, then ^7 := {b G Nq : |b|i < p} is a 



total-order expansion of degree p. This expansion converges in the L sense as p — >• cxd. 

Consider the latter (In-Gc) case; here, the derivative of the polynomial chaos expansion is 



d 
dda 



Ge(0«,d) = exp 



(33) 



In the former (Gc without the logarithm) case, we obtain the same expression except without the 
exp [•] term. 

To complete the derivation, we assume that each component of the input parameters 
and design variables d is represented by an affine transformation of corresponding basis random 
variable S: 



dl'-ng = ll'+^v'^V, 



(34) 
(35) 



where 7(.-) and ^ are constants, and I = 1, . . . , ng and /' = ng + 1, . . . ,nQ + Ud- This is 
a reasonable assumption since S can be typically chosen such that their distributions are of the 
same family as the prior on 6 (or the uniform "prior" on d); this choice avoids any need for 
approximate representations of the prior. The derivative of ^'b(^(0^*\ d)) from equation (33) is 
thus 



dda 
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(4)) (36) 



and the derivative of the univariate basis function ijj with respect to da is 

d . . d 



{S.a+ng{da)) 



''Pba+ng i^a+rig) , Ca+ng{da) 
O^a+ng " Oda 
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(37) 



where the second equality is a result of using equation (35). The derivative of the polyno- 
mial basis function with respect to its argument is available analytically for many standard 
orthogonal polynomials, and may be evaluated using recurrence relationships [70]. For example, 
in the case of Legendre polynomials, the usual derivative recurrence relationship is -^ipniC) = 

[—b^ipniO + bipn-iiO] /(I ~ C^)) where n is the polynomial degree. However, division by (1 — ^^) 
presents numerical difficulties when evaluated on ^ that fall on or near the boundaries of the 
domain. Instead, a more robust alternative that requires both previous polynomial function 
and derivative evaluations can be obtained by directly differentiating the three-term recurrence 
relationship for the polynomial, and is preferable in practice: 
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MO 



2n - 1, 2n 



1 d 

— c-^„_i(e) 

n a4 
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1 d 



V'n-2(0- 



n n dt, n ok, 

This concludes the derivation of the analytical gradient of VC/Ar,M(d, Og, 



(38) 
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